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ABSTRACT 

• * 

Ffelative 90st and effectiveness of techniques for preparing a computer 
compatible data base consisting of approximately' 'one millTon words of 
natural language are outlined. Considered are dollar cost, ease of 
editing, and time consumption. Facility for insertion of identifying 
informAtion. within the text, and updating of a text by merging with 

* another text are givetr s special attention. * It is concluded that MTST 

and Telterm2 are two highly effective methods Qf text preparation, 
i 

The decision of which to use on a particular project would depend on 
/ available .funds and possible peripheral uses for the equipment. ^ 
Criteria for making such a decision are disfcussed. 
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AN ANALYSIS OF METHODS FOR PREPARING 
A LARGE NATURAL LANGUAGE DATA BASE 

4 *I Introduction 



The speed and versatility of tlie computer for processing natural 

language text is an important aspect of current research. The prolif- 
« 

eration of .written information has given birth to numerous computerized 
information retrieval systems which must cope with natural language. 
Linguists, educators, social scientists and scholars in the humanities 
are closely scrutinizing language usage in order to develop models and 
systems in their respective fields. As the computer industry develops 
more sophisticated approaches with a "conversational mode" between the 

J 

user and the computer, the processing of the natural language of human 
beings takes on new importance. /With the increase in special purpose 
programming languages* corapiley design and development, applications 
increase and more natural language techniques develop as a by-product. 

A number of probl^nfs arise in natural language or "text 11 processing 
by a computer, roapy of which occur long before the data reaches the 
computer. -JNafeGral language comes in many forms, none of which are 
directly computer compatible*, Spoken language can be recorded on 
. magnetic tape or printed in a book or magazine, but computers are not 

yet >generally equipped to take these forms as direct input to their 

f , • 

npiex electronic circuitry. Computers need information in the form 
of on-off codes, commonly referred to as "bits 11 and represented as 
I's and O's in groups of 6 or 8* called "bytes," with' each 'such group 
usually representing a single character./ For example, the letter A 



1 2 3 

may be represented by 1000001 or 11000001 or 110001 , depending on 

the computer used to process the data. Generally speaking, computer 

compatible information is stored in one of three basic forms: punched 

\ cards, punched paper tape, or magnetic tape. The problem arises in 

X * * 

finding an accurate, easy, and cost-effective method of transforming 

voice recordings or printed text into a form directly accessible by 

r 

the computer. \ 

II. Overview of Sykt&ns 
In an kttempt to discover* a'r\ effective method for preparing text . 

i • * * 

acceptable to the computer/* a comparative study was undertaken of devices 

which might be used to prepare a Qne million word file of ^natural 

language text. The. systems analyzed were KEYPUNCH, ^TELETYPE, FLEXOWRITER, 

MTST, DATAPLEX, /ATS, TELTERM2 (CRT), ^nd OPTICAL CHARACTER SdANNiNG. 

^ Investigation was* limited *tp techniques immediately available in 

* * * * 

e 

the greater Xos Angeles area, and all prices quot.ed are those applicable 

t ^ to, a -non-profit research or educational institution. In order to* main- 

> * 
# tain directly parallel prices Jfor easy comparison, any in-houae facilities 

(such as a Keypunch machine or computer) accessible to the researcher ^were 

- not considered # cost free. In each case, commercial prices aire given. ^ 

An overview of each system wais made by contacting 'vendors and t 

reading the printed Material available on each system, Items considered 
» * . 

wete dost in dollars, time consumption in operation, and ease of. editing. 
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ASCII (AmeHcan Standard Code for Information Interchange) 

2 1 ' * ^ 

EBCDIC (Extended Binary Coded Decimal Interchange Code) , \ 

^ BCD (Binary Coded Decimal) " ) . * 
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All cost estimates were made on the basis of one million words of 

** "< ' y 

text. All terminal rentals are considered on a ofte-year lease basis ' 

only, total rental^ figures are for one full year. Cartridge, cassette, 

and disk storage costs are given on the basis of a full 5,000,000 

characters being stored at one time, which could be avoided in practice 

by more frequent dumping to magneti^ tape if a sufficiently flexible 

magnetic tape updating facility is available. Magnetic tape purchase 

cost was not^ considered, since it would remain constant -regardless of 

the system employed. 

y.. • - , . 

Keypunch (IBM 029) 

The keypunch is an off-line method for %tdring information on 

f " . 

lightweight cardboard by mechanically punching holes in the cards.' 
The arrangement of the holes represents coded data, A high noise 
leyel is created by the' card punching.^ . Keying may be done on an IBM 
029 leypunch mathine and hard copy may be obtained by "listing 11 the 
cards using^a computet'. A trained keypunch operator is required. 
Keypunch training takes an -average of six weeks." Typing errors made 
and caught during entry 'are correctable by using a verifier (which* 

^ 4 

resembles a keypunch) and duplicating the card up to the error and .then 
typing in the correction.* Edit operations such as insertions,- deletions 
etc* are 'difficult since many cards -may li^ve to be r^punched to take 

care of the corrections. The best approach fox insertion is to leave 

* * •* 

a number of blanks *at the end of the card, so that the card may be 
duplicated up^to the needed insertion, and the insertion made without 
overflowing into another car^i. The rest* of the card must be retyped, 



_sipce the olci card is moved out of position for further duplication by 

! ' « 

.the typing of the insertion. Programs written to process cards produced 

! * * • * * 

in such a manner must be written (or available) to treat multiple blanks. 

'r Problems .existing with storage on cards include such things as 

% • 

f ' * 

wrapping of cards stored irt non-standard humidity or temperature. The 

- 1 ~ 4 

bulk of the 62,500 cards required to hold one million words is 

enormous and reordering a deck of dropped cards can be difficult if 

they are not sequence numbered. For these reaspns, and others, it is 

convenient to convert the cards to magnetic t^pe. Such conversion can 

be done by using card to tape equipment. 



Table 1 

i 

KEYPUNCH" 



Item 


Breakdown 


Total - 


Keypunch purchase 

\ •'. ". 


$3490 + $28.75/mo. 
maintenance 


$3835 ($3490 f 
$345/yr. .mainte- 
nance) 


Keypunch rental 


,029-$77/mo. 


$924 (12 mo. 
@ $77 ea.) 


Storage .cost (card) 


$1.15/1,000 + 
$12 set up 


$92.50 (70,000 
cards) 


Conversion to magnetic tape 


$1* 00/2,500 cards 


$28 (70,000 .cards) 


Total per year if leasing 




$1044.50 m 


" Total if purchasing 




$3955.50 



Each card may have up to 80 characters. Since the million word data 
base would contain approximately 5,000,000 chracfers, 62,500 cards 
(5,000,000 4 80) would be required.^ j <* ' 

Prices quoted by: Mr. Bob Wright, IBM, 445 S. Figuroa, Los Angeles, 
* (213) 620-1830 (IBM 029 Keypunch, cards); Mr. Harold Hackman, SBC, 
•11300 La Cienega Boulevard, Inglewood, California .90304 (213) 776-5900 
(card-tape conversion) 1 \ 4 
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Teletype (Western Union -Terminal 33) 

The Teletype is 'an off-line method for storing information on* paper 

tape by mechanically punching holes in the tape with eight .punched 

/ 

positions per code. Keying may be done on a standard teletype keyboard 

with simultaneous production of paper tape and hard copy, 1 

Typing errors made and caught during entry can be corrected by the 

" use of "control" characters which indicate to the computer thai: either 

the character immediately preceding or the present line should b6. 

ignored when the paper tape is processed on-line. These . codes can be 

followed by th£ corrected copy, Edrt operations such as character 

f 

insertions can theoretically be accomplished by duplicating the paper 
tape, stopping it exactly at the appropriate spot, adding the insertion, 
and continuing the duplication. In practice, such accuracy in ^positioning 
the papec tape is virtually impossible as in the case of punch cards. 
Although paper tape itself, is directly computer compatible, it is 
bulkier- than magnetic computer tape and more susce.pt ible to problems 
created by handling. Approximately 55,000 feet of paper t^pe would 
be required* to store the full one million words. Conversion from paper 
tape to magnetic £ape can be done directly by computer. 



6 f 

Each inch qf paper .tape contains 8 codes; one foot contains 96 codes. 

Therefore, 5,000,000 characters would take 52,083.3 feet (5,000,000 
96). A paper tape is 1000 feet long and contains 96 codes^ per •foot. 
Thus, 19,200'five character words could fit on one roll if no errors 
were made which required extra codes. For purposes of estimating, 
the figure 20,000 words per roll was used. It is significant to not^ 
that the full mil-lion word data base would take about 55,000 feet> of 
paper tape, which would be cumbersome and fragile to handle. The 
. estimate on conversion to magnetic tape is. purposely high. The time 
would depend on operator and machine efficiency. 



.Table 2 

teletype' 



Ttem 




■ c 


icLluxnal pULLIlaoc 

> * 


$877 + $100 (complex) 

4- ^SO (tn rp thp irrter~ 

faGe adapter) 


$1027 


Terminal rental 


$55/mo 25/mo for data phone < 


$960 (12 mo. 
, {3 $80'ea.) 

* 


Paper £ape 


$1.15/1000 ft. roll 


$63. 25 .(55 
rolls @ 
$1.15' ea.) 


^Conversion .to m£gnetj.c 
tape / \ 


$257hr. of computer tim^ 


$250 (10 hrs. 
@ $25 ea.) 


Total' if pleasing 


* 


$1273.25* 


•Total *tJE purchasing 




$1340.25 - 



Flexowriter* 2301 (Friden) '. 

i , The Flexowriter is an off-line system for storing information on 
punched paper tape very much like teletype . Keying is done on an 
electric typewriter keyboard with paper tape punch and simultaneous 
production of hard copy and paper tape with eight punch positions pex 
code. Typing errbrs made and caught during entry are correctable by 
moving the paper tape back an<J blocking out' the error with the- tape 

i 

feed key. The correct code may then be entered. Af ter-th^e-f 3£t 

• i * 

Correction suqh as insertions, deletions, etc. can be accomplished by 



Prices quoted by Mr. Jacobson, Teletype Corp., 5720 E.' Washington 
Boulevard, Los Angeles^ 213) 724-6040* (Exchange data terminal 33 
with pap4r tape reader/punch and coupler); Mr. Bob Bewak, Stat Tab 
datix Service Center, 1519 Olympic Boulevard, Los Angeles, 90015, # 
(213) 381-7251., • ' 
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playing, the punch tape back at .145 words per minute. A new paper* tape 

is punched simultaneously with playback, <and updating can be done by 

stopping the original tape and making the correction. Positioning the 

tape at the appropriate place is difficult, but more easily accomplished 

than with teletype. The or iginal-to-edit process may be repeated as 

often as necessary. Although the paper tape produced by the Flexowriter 

is directly machine compatible, the placemen^ of the sprocket holes 

necessitates the reversing of the tape for read-in to a computer.^" 

Read-in under. suph conditions causes all lower case to be capitalized 

and causes problems with numeric and' punctuation codes. As with * 
+ v 

teletype, approximately 50,000 feet of paper tape would be heeded for 

1 ' 

data storage. - . 



Table 3 
FLEXOWRITER 
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Item 


Breakdown 


Total 


Terminal purchase 




S3300 ' , 


Terminal rental 


$105 /mo. 


S1260 (rms <"d 105 ea.) 


Paper tape cost 
(see copiment #2) 


$1.15/rdll (need 50) 


$57.50 (50 rolls' 
(? $1.15)' 


Conversion * ^ 
(see comment #2,), 


s ^ 

$25/Jir. of computer 
' time 


$250 (10 hr. 
r<j $25 ea.) 


Total of lease 




$1567.50 


Total of purchase 


■ 


r — 1 : 1 

. $3607.50 



, Prices quoted by Mr. Michael A. Jackson, Singer-Friden Division, 1720 
Beverly Boulevard, Los'Angeies 90026, (213) 483^4800 (Flexowriter 230 



U 



MTST (IBM, Magnetic Tape Selectric Typewriter) ' f « 

The V MTST is aa off-line system for storing informatibn on magnetic 

"cartridges which have a , capacity of approximately 24,000 characters. 

♦ 

Keyboarding is done on an IBM Selectric Typewriter with simultaneous 
production of cartridge and hard copy. A trained MTST operator i$ 
required for keyboarding. Initial training consists of 2-4 full days 
in a special course provided free- by IBM. Typing etrors made and ' 
caught during entry are correctable merely by backspacing and retyping. 
For editing, the ,tape cartridge can be played back at 150 words per 
minute, with such changes to content' as insertions, deletions, and 
substitutions accomplished by recording onto a second tape the material 
as it is played back from the original. The corrections c^n be t'ype'd 
by stopping the play-back at appropriate points'. If "reference codes 11 
are inserted at intervals in the original, the search scan can run 
through the tape at 10,000 words per minute, stopping at the position 
after the specified reference code for insertions at";that point. The, 
original- to-edit process may be repeated as often* as necessary. The 
final step is transfer frpm MTST cartridge directly to magnetic tape 
If further editing is required, the magnetic tape can be converted pack 
to MTS.T cartridges for- use on the MTST device. " Approximately 210 
cartridges would be needed for data storage. 




Each cartridgg^jklds 24,000 characters, therefore, £,000,000 
ch^rACj^^^ml^ take 208.3 cartridges (5,000,000 4 24,000). The 
ridge cost would probably be lower in actual practice, 
ce cartridges could be used -several times, thus lowering the 
to'tal number needed. * * 
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Table 4 
MTST 10 



\ 



O 

'ERIC 





; v : 

DreoKQown 


local 


Terminal purchase c 




$10,035 


Terihinal rental 


$25 7 /mo. 


. $3084 (12 mo 
@ $257 ea. ) 


Cartridge cost ° 


$15 ea. (need 210 
<? 24,000 ch\ ea.) 


$3150 (210 
(? $15 ea.) 


Conversion 


$2.50/cartr'idgje- 


* 

$'355 (210 • 
(a $250 ea.) 


Total if leasing 




$6589 


Total if purchasing 


i 


$13,540 ' 



Dataplex (Data Instruments) 

The "Dataplex" is an off-line keying, on-line editing system, us£ng 

an IBM Selectric typewriter recording simultaneously on paper and on^ 

cassette which holds up "to 50,000 characters. The system is designed 

to work with its own ✓ 12k computer, utilizing software created sp'ecifi- 

cally for, a particular job.. Typing errors made and caught during, entry 

( 

are correctable merely by backspacing and retyping. Errors caught before 
the end of the Qassette can be corrected by typing correction ^commands . 
There is no playback capability at, this stajge. Further Editing can be 
done in batch mode either by hand carrying, mailing or tele-'process ing % # 
the original cassette and a corrections cassette, to the processor Which 
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Prices quoted by Mr. Chuck Zander, IBM, 9045 Lincoln Boulevard, * 
Los Angeles, ( 213) 670-8350 (MTST with code ^conversion -and reverse 
search, cartridges Jn lots of 150 or more); Mr. Del 'Seraphine, 
Autographies, 751 Monterey Pass Road, Monterey • Park 91754, .(213) 
263-2184.-- ■ . * * y # - * > 



converts to magnetic tape. The" processor reads in at a rate of 200 

characters per second. , The processing could be done by Data Instruments 

on a service bureau basis, utilizing a program written by the user or 

specially prepared by Data Instruments Staff, The update editing oi 

the magnetic tape can be "carried out any number of times by submitting 

further correction cassettes. The complexity of the commands required 

for the correction cassette would be determined by the sophistication 

of the software produced. Magnetic tape is always in most updated form 

and hard copy is produced by line printer at each step after the f.irst. 

11 

Approximately 100 cassettes would- be required for da'ta storage. 

Table 5 . , 



DATAPLEX 
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Item 


Breakdown 


Total 


Xerminal purchase 
I 




$4700 


Terminal rental 


$104 /mo. 


$1248 (12 mo. 
@ '$104 ea. ) 


/ 

Cassette cost 


$5 ea. (need 100 
@ 50,000 ch. ea. ) 


' $500 (100 
@ $5 ea'.f 


Updating processing 

r 


' *- $24'/hr. (need 42 hr. 
» per editing pass) 


$3024 (12 ! 6 hr. 
0 $24 ea.) 


Total'if losing <» ' 




$4772 : - ' 
t 


Tot&l if purchasing 




$8224 



11 



Each cassette holds 50,000 characters. Therefore, 5,000,000 
'characters would take 100 cassettes (5,000,000 4,50,000). . 



12 * G» 

Prices quoted by # Mr . Ed Russo, Data Instruments Company, .16611 §oscoe« 

* Place, Sepulveda, California 91343, .(213) 833-6644. With possible ' 

exception of /.cassette cost, costs stated are not flexible-? In 

addition, approximately one week minimum of programmer time must 

be invested. ' 

) ■ ' \ ■ • • • 
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TELTERM2 ■ (Itelta Data "Systems) 

■ - - _ _ - , ' 

The Telterm2 is a keyboard' CRT terminal with up to 3,000 characters 

" . , - ■ t . — 7' 

of built-in m|»mory. The terminal can also be used on-line, connected by 



phone lines to a 'computer. Data can be t^j^jJttMjL of f -line onto a 

m WW < 

cassette tape attachment for later f urtheW^PBting . Each„ cassette' pan . * 
hold up to 50,000 characters. The *CRT s*creen displays 2.7 lines of 80 
characters each, and immediate roll-up and^ roll-down capability is 
available to display other portions of the full 3,*000 character memory. 
Since only the data up to a carriage return is contained in memory, and 
not full 30 character line, up to 100-150 lines of text may* be ^readily 
available for editing and dii^t display. Errors made and caught .during 
entry are correctable by backspacing and retyping. Errors caught before 
transmission of the full 3,000 characters of memory to the^ cassette are 



correctable by moving tj^ cursor to the appropriate position and retyping. 

. Also available are insert and delete function keys. - To insert .new material, 
the user merely positions the cursor at the point where the* insertion is 
to be made, and 1 pushes the INSERT key. He may their* type in whatever 

* insertion he wants, and the Telterm2 will automatically "wrap around" 
the rest of the data contained in its memory. An END INSERT key is used 
to complete the insertion., A similar procedure produces deletions of 
either single characters .or complete lines. A format mode of data entry 
#l]^ws the user to specify "fTxed field, and variable fields which may 

• have different data depending upon circumstances. Data can he transferred 

x * ' 

in blocks at a high speed, ^and either the full 3 , 000 character memory ot \ . 

\ > *** 

individual sections of memory called' "messages" can be dumped^, onto, , 
, cassette or through a cdmputer to magnetic tape. i 



hi iiiitir 



If further editijjg 



is "required, the material on cassette can be 



d back into .the Telterm2 memory and displayed on the screen., Hatd 



V* 



copy is available by printing the cassette onto a hard copy terminal or 
listing the magnetic tape on a' high speed printer. The Telterm2 can be 
configured to handle two cassettes, one for input, the other for output. 





Table 6 ; 

13 

TELTERM 

• 


> 


Item 


'Breakdovfrn 


/ Total 


Telterm2 purchase 


* 


* i — — c — 

$6500\ 


Telterm2/ rental 


$367.50 


$4410.00 (12 mo. 
<a $367.50 ea.) 


Cassette cost 


$5 ea. (need! 100 . 
@ 50,000 ch.^ ea'.) ■ * 


• - $500 (100 @ $5 ea.) 


£onversion 4 to magnetic 
fape : ,4 


$25/hr. of computer 
time / . v > 


$250 (10 hrs.' 
@ $25 ea.) 


Total if leasing 


* * 


$51b0.00 


Total if , purchasing 


A 


$7250.00' 



ATS (Administrative Terminal System Lpoynerly IBM 1 s DATATEXt] ) 

The ATS is an on-line software system for storing information* on* 
disk and allowing direct editing through terminal-entered commands. 
Significant editing capability includes formatting ability to change" 
number of 'characters per line, or numbfer of lines pef page. Typing 
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Prices quoted by Mr. Larry Lcfhr, Data-Serv, 15114 Downey Avenue, 
Paramount, California ' 90723/ (213) 531-6161. Price includes / 
Mobark 400T" cassette attachment ($1800) ^ It would be possible 'to 
use th£. Livermore.. cassette^a*tt£chments ($37S) with slight inconve- 
nience. Lease' is a lease-*pur£hase on t Wo year purchase commitment. 



» 
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errors made and caught during entry are correctable merely by backspacing 
and retyping. Such changes to content*, as insertions, deletions, subr 
stitutidns,* and rearrangements of words, and phrases are made on-line 
by specifying the changes themselves and * their locations* The user 
must specify enough unique information on an item to allow for ex&ct. 
scan matches, as \jell as specifying line numbers^lrfcafrions . ATS had 
playback capability (at any .point in the editing process) of 140 words 
per minute on typewriter terminal, and 5,(jP lines per minute on high 
speed printer. ATS (as implemented by Areata Data Systems) also can* * 
output to microform and/or to a computer typesetting device automatically 
and directly.. 

In addition, the ATS system has thq capability of producing, in 
interactive mode, the equivalent of one line KWIC's and .counts of 
occurrenpes from which frequencies can be easily computed^. 

Material on disk can be dumped to magnetic tape at minimal cost 

when extended periods of hand-editing are required on hard copy. Such 

a procedure saves disk storage costs which are computed on 1 a monthly 

/ 

basis. „ * 
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TaM'e 7 
, ATS 14 



^Item 



Breakdown 



Total 




Terminal purchase 



Terminal rental 



$100/mo. {requires Modem 
hookup - possible addi- 
tional $25 /mo,) 



Storage cost 



$^5/1550 characters/mo. 



$1200-$150 ¥ 0 (12 
mo. @ $100-$/25/ 
ea. ) 



Conversion to* magnetic 
tape 



1 

$10 (on or off disk) 



Total if leasing 



To£al if^ purchasing 



\ 

Optical Character Scanning (FormScan) 

*■ * * i 

OCS is an on-line scanning system using as/ input a typescript on 

s • 

standard paper made with 'an IBM Selectric type^rjlter. Type fonts which 
-can be read ^re^l403, OCR- A and Pica. Upper /a E nd lower case capability 
is available with Pica only, and a modified /font must be*obtained from 
FormScan to Preclude 0-0 and 1-1 confusions. 
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Prices quoted by Mr. Bruce Hawley, ButlerjData Systems, 12911 Cerise 
Avenue, Hawthorne, California 9025Q, ( 2li) ; 772-2331 . Storage figure 
is shown in parallel fopa to cartridge and^ cassette cost, but is 
totally unrealistic in terms of actual use< of system, since in real 
use,* the majority of data would be kept oti magnetic tape, saving . 
storage costs, and only dumped onto disk for short periods for 
editing. A more realistic figure would be closer to $3000, giving 
a total of $4210 to $4510 . * ' * 
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The non- re flee ting areas (i\e., the black letters) are read and 
coded by a CRT and transferred to* the character identifying element of 

the system. Character recognition is accomplished through the division 

' ; • / 

of each 1/J.O 11 X 1/10" character area into a grid with 1200 parts and i 

software identification of the grid areas .which cpntain data specific 

to £ particular character (ex, a or j or @) . identification error is \ 

at a maximum of 1 in 25,000 characters for upper and lower case atid 

/ 

much lower for-al} caps. The system produces, 7 track, 556 BPI, BCD 

coded tapes directly and a conversion to 9 trade, 800 BPI, EBCDIC coded 

m 

tape is available for $15. Also available is an updating .feature 
allowing lines with errors to be corrected by typing a sheet identifying 
the line number and cthen typing* the line as it should be. ? * 



Table 8 
DCS 15 



< Item 


Breakdown 


Total 


■ 

Terminal rental 






Storage cost 






t Convention Typescript 
* 


.01/line 


$666.67 (66,667) 
lines of 15 words 
ea. @ .01 ea.) 


Conversion fxom OTStf 
Typescript 


. 3.20/cartridge 
approximately 


$666.67 


Update 


$50.00 minimum 


* 


Total of leasing 




S666.67 

* 


Total of purchasing 




Not applicable 
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■ : - :* 

Prices quoted by Mr. Harley D. Hancock, FormScan Inc., 16220 Orange 

.Avanue, Paramount, California 90723, (213) 636-2441. Costs are 

deceptive since they do not reflect cost of preparation of a perfect 
typescript to be scanned. v 
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Table 9 provides a quick comparison of the various input systems. 
In x:a^es where some systems lack capabilities found in others, infor- 
mation* in .addition to the relative capability may be helpful. The 
following comments relat/e to each system which lacks- a specific 
capability found in other systems. * * ' 



Upper and lower case * 
OCS - upper' and lower case available onlyvith pica 
CRT - upper and lower case is option^ on Telterm2 



Prpduces ASCII coded magnetic tape 
WTST - produces EBCDIC 
ATS - produces EBCDIC 
OCS - produces BCD 

Unlimited insertion 



KEYPUNCH - insertion limited by number of, blanks left£ at end of 

, ■ I card • • j 



4 



TELETYPE r insertion practically impossible 

"* — ■ ■ — 1 1 

OCS - insertion practically impossible 

4. Update facility . 

KEYPUNCH -'update requires card duplication and is limited t>y 
number of blanks left at end of card 



Off-line keying 
ATS - it is -feasible to load a file created off-line 



onto the disc 



7 - I 



s 

6. Off-line edit. 



TELETYPE - computer ed.its 'using "control" cnaiyact<ers 

$ • DATAPJ£X^- computer jedits using correction cassette ^ 

* * 
•ATS -.computer edits using 'tetmirjal input, commands 



7. ' Backspace .correct 

» 

KEYPUNCH - requires duplicating whole card 

. TELETYPE - uses "control" characters 

> 

FLEXOWRITER - uses "control 11 characters 

8 • Needs 'no lirte feed with carriage return 

f ' . - 

TELETYPE - needs line feed 

CRT - needs line feed 4 • 



9. Initial playback of hard copy 

KEYPUNCH - requires computer listing of cards 
DATAPLEX r none available % 

CRT - cassette can be listed on computer printer or terminal by 
• phone line 



10. Hard copy of edited file 



KEYPUNCH --requires computer listing of cards 
TELETYPE - paper tape must be carried to processor 

11 . No hand carrying to magnetic tape » 

KEYPUNCH - card deck must be taken to computer location 
MTST - cartridges must be taken to converter location 



ERiC 
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i 



1 2 . Untrained bperator . • * 

2YBUNCH - six weeks training , 

• three days provided f#ree by IBM 




13 . Uses other than input preparation , 



TELETYP^ - can be. used as terminal 

MTST - /can used for secretarial task6 

CRT - /can be ur^ed as terminal 
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III. Test. On 300 Word Sample 

' f 



From the point of view of entering and .editing, inpjit text, one of 

the most difficult problems is inserting identifying codes withiri the , « . 

tex£ to discriminate among categories of information relevant tp the 

analysis being don'e. For example, a linguist might wash to. mark each 

* *• 

relative clause or noun phrase with a code so that he could obtain 

I, 

information on their use as well as Information on single words. Because, 
the complexity of inserting such codes, into a running text represents 
one of the most difficult problems for any input system to handle, it 
was used a's a task to test the'systems under study. 

A 300 word sample representing a hypothetical transcription of 
audio-recordings was used for a test. There were" copies of two 
transcripts in three stages of editing: (1) first typescript from 
audio tape; (2) intermediate hand-coded rough; (3) final oopy. r On 
stage 1, examples of simple typing errors (e.g., misspelled word's) 
were inserted for immediate backspace correction. On stage 2, examples 
of types of codings or changes which might occur were inserted. Codings 
used were arbitrary and have no actual significance. As many types of 
editing problems as possible were incorporated within the example. 



Stage 1; 



Example of Sample Used 



%AH 


THIS 


GUY 


I 


HE 


MGTH 


f 


HIS, 


HS 


INTENTINS 
BUT 


MIGHT ' 


BE m 


GOOD, 




YOU 


KNW, 


HE'S 


JUST 


DOING 




H.IS 


JOB, 


YOU 


. KNOW? 


RICH? 


(T7558) 




%AE 


TRUE, 


I t 


AG RE 


WITH 


YOU 





er|o 



22 



21 



Stage 2: %AH THIS •• GUY, I- 

'HIS, ' HIS, INTENTIONS k MIGHT 

YOU .KNOW, BUT HE'S 

his ' job; YOU KNOW? 

%AE • TRUE, I ' AGREE 1 

Stage 3: $15,051 %Z1 THIS 1201.1 GUY, 57420 - 
I 72100 HE MIGHT, HIS 72108 
INTENTIONS MIGHT BE' GOOD, 57300 57420 ~ 
\ YOU 72208 KNOW, (-BUT +) HE'S 57311 

JUST DO'ING HIS JOB, .YOU 72108 KNOW? 57441 
RIGHT? (T7758)- 

%AM TRUE, I AGREE 5200 WITH YOU $$15,051 



HE- 
BE 

JUST 



MIGHT, 
GOOD, 
'DOING 



RIGHT?- ' (T7258) 
WITH J*-YO.U\^^ 



After arranging with the vendors for a test, they were asked t'o 
prepare .the data, using trained operators and demonstrate each step 
from rough copy to magnetic tape. Following are the results of those 
tests: 

KEYPUNCH ^ 

Tab setting was done by means of a program drum card. Set up for 
this card took approximately three, minutes . Input typing -proceeded at 
a rate of approximately 30 words per minute with an excessive noise 
level. During initial keying two errors were caught and corrected by 
duplicating the card up to the point of the error, correcting the error 
and duplicating the rest of the card. , When the original test deck was 
finished, the operator attempted to perform the editing functions of 
adding codes, and changing* items indicated. After only a few cards, it 
became apparent that almost every card was being retyped i*n order to 
make the corrections, and the Dest was abandoned. 
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TELETYPE "1 
t . 

* - f 

No tab setting was available, Input typing proceeded , at a rate of 
approximately 25 words per minute with some confusion caused by the need 
to multiple space between words to simulate tabs. Noise level was 
•high. Two control character corrections were, used on inHlal keying, 
and proofreading indicated one additional keying error to be corrected 
during editing. There was no take up reel for paper tapes produced. 
Editing proceeded by playing back the original paper tape and attempting 

/ • ■ ' " ' 

to stop it at appropriate points for insertions. After approximately 
fifteen minutes and dozens of attempts, the operator admitted she was 
unable to perform even the Ifcrst insertion and the test was abandoned. 

fJjEXowriter y J 

Tab setting for initial input was smooth and uncomplicated. Input 
typing proceeded at a rate of approximately 30 words per minute with a 

♦ > v - 

high but acceptable noise level. There was no take up reel, for' paper 

tape produced. ; 

s Editing proceeded by playing back the original tape akd stopping 

at appropriate points for insertions. A new paper tape wa£ punched 

during editing.^ Stopping was accomplished at the proper point in all 

but one instance. The entire new tape had to be run through another 

« z 
pass of the editing procedure for th? final correction to T>e made. 

Y* 

MTST ; " 

Tab setting for initial input was smooth and- apparently uncomplicated. 

Input typing proceeded at a rate of approximately 40 words^per minute 

i 
t 
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with an. acceptable noise, level/ The process of setting up MTST tapes 
for editing (rewinding, moving tape to other side of machine, etc. s ) ' 
*took approximately one minutes* 

Editing was accomplishecf by playing back the original tape, stoppings 
at appropriate points and 'typing in the insertion. No special commands 
were required. A single button accomplished playback and only the actual 
characters to be inserted were keyed in the editing stage. 

As many as 130 characters may be keyed per record initially and 
there is automati^Lcreat ion of a neto record if editing insertions cause 
an overflow. There is no automatic facility for removing superfluous 
blanks. 



f 

DATAPLEX 



Although the specifications of the task had been specified in, 

advance, Data Instruments was unable to have a system ready to view. 

The explanation given was that computer software was required specific 

to "the task in order to have their machines process material. The user 

was expected to provide programmer time to write such software which 

had to be written in Data Instruments own programming language. 

Assurances we^re given that once such software existed^ the system . " < 

could handle any contingency. There was, ,of course, the understanding m 

that the efficiency of the system would be directly dependent on the 

t. - * » 

sophistication of the software the user produced. 

CRT N 

Tab setting for initial input was smooth and* uncomplicated . Input, 
typing proceeded at approximately 30 words per minute with virtually no' 
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noise • "End of message 11 codes were keyed kftex 1500 characters (20 

1 *' 

lines). /When the 3000 characters buffer'Vas filled, the operator dumped 

' J ■ ■' - ' ' " • 

ttoe buffer, one message at a time, onto tfre casset&e* attachment . Setting 

*' • • i 

up. the tape' taok two simple operations^ /The terminal was switched from 

v " ' * ^ 

LOCAL to ON-LINE for. transfer, and from TYPE to TELETYPE mode. There 

ft ' , 

were, then four transfer commands entered/ on the keyboard/ ' * * 

Transfer proceeded smoothly. Hard copy was obtained by playing 0 

back the cassette through a, teletype. Setting the tape took two simple 

operations. * * 

Data wa& then transferred from the cassette back onto the Telterm2 

* 

for further editing. Insertion was accomplished by moving the cursor 
to the appropriate point, pushing the "Insert start" key and typing in 
the insert. Editing proceeded "swiftly and surely. Superfluous blanks 
were removed «$th the "delete character 11 key. When editing was completed, 

. • ' * "W. - 

the operator again, .simulated transfer to cassette. ^ 



ATS 

Tab setting for initial input had to v be done twice due to operator 

error.' Input typing proceeded at a rate* of approximately 40 words per' 

o • • 

minute with an acceptable noise level. There was an editing set^Op 
1 ' ^ ' , / . 

procedure recfuiripg two commands, taking only a few seconds. 

The editing procedure proved to be unacceptable, due to'the-fact 

that the commands required by the system necessitate the typing of the 

full word occurring before the desired insertion. In this .particular 

application, with its initial one-te-one expansion, it would be more 



efficient to retype, 
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Several other features of interest were demonstrated, however. 

ATS has the capability of scanning for a particular character set 

. *> • 

(e.g., a specific linguistic code) and giving immediately the number 

of occurrences. If can also print out each line on which such a 

character set occurs • (roughly equivalent to a one line KWIC). For the 

'purposes'of initial "browsing" in data, ATS may prove to be highly 

cost-ef f ebt ive.' - . - ^ 



DCS 



0 ' * » 
Since the preparation and editing of text is not a funqtion of „ 



scanning, ,tio parallel test wFs made of the OCS system. The operation 
of the scanning equipment was reviewed for accuracy and ease of 



conversion 



A stack of typed sheets was input to a hoppet which fed them into 

"J * "J 

the scanning section of the machine. Transfer to magnetic tape proceeded - 
smoothly until a character wa$ encountered which the machine could not 
recognize. This character* (a smudge J d "m"^ was projected on a screen 
and t/he operator /entered the correct choice through a teletype, ^here 
were \p other errors/ The magnetic tape was then played back through a 



teletype for final proofreading artd hard copy, 

IV. Conclusions 



*3 



> / 



After 3II systems 'were evaluated, it became clear that two, MTST^ 
and. Telterm, we ( re def initel|psuperior for th§ task ofj>reparing a large 
natural language data base which" required heavy editing and insertion. 
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In each of the other systems reviewed, correction of typing errors 
caught during entry was either cumbersome, time-consuming, or virtually 
impossible without retyping a large section of the data. If the data 
was keypunched, a card wi£ji,an error required repunching. In the case 
of the Teletype, such corrections. ^quired an on-line edit performed by 
a computer. ATS used both a cora'puter and a cumbersome line** referencing 
and retyping process. In the cases of MTST and Telterm, such corrections 
could be made directly, immediately, and simply. • . . - 

In most of the other systems, large scale insertions were impossible 
or extremely difficult. On Telterm, such insertions could be accomplished 
by simply pushing the "insert 11 button and typing in the new data. On 
MTST, insertions could be made by playing the data back to the point at 
which the insertion was to begin, typing in the new data, and then * 
playing back, the old data up to the next insertion. 

MTST have the advantage over>> TELETYPE, PLEXOWRITER, DATAPLEX, and 
ATS in that they can do all the required data preparation and editing 
off-line, thus requiring no costly computer time. 

y 

Both MTST and Telterm are significantly better than other' systems. 
Both have equal flexibility and editing capabilities^ for text input. 
Any final decision on which of the two is the most effective method for 
a particular project requiring text preparation must be based on a 
number of factors not directly examined in this study, but relating to 
the project itself, its task, funding and scope. 4 

If the particular project is one which requires hard copy at every 
step, MTST would be pref-erable, as. it would be, if the researcher is in 
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a non-computer oriented environment, where he would have no use for the 
teijpinal cj^abilities of -the Telterm after completion of data preparation, 
but could use the MTST in normal secretarial routines. 

If, on the other Hand, the project is in a computer , oriented ^ 
environment where. the terminal capabilities would insure continued^use 
for the terminal, Telterm would be the logical choice. Likewise, all 
other factors being equal, Teltermjwould be the preferred device, since 
' MTST costs $2,875 per year more than Telterm on a- one year lease in 
basis, and $7,010 more on^a purchase basis. 

Above all else, onie fact emerges clearly from this study. The 
rapid proliferation of sophisticated' input techniques has assured the 
researcher natural language tools which eliminate previous roadblocks 
and allovf^him--to concentrate hi-s energies to a greater extent on the 
research itself, rather than the mechanics of data preparation. 



